Circuits and methods for high-capacity asynchronous pipeline

ABSTRACT

A latchless dynamic asynchronous digital pipeline circuit provides decoupled control of pull-up and pull-down. Using two decoupled input, a stage is driven through three distinct phases in sequence: evaluate, isolate and precharge. In the isolate phase, a stage holds its outputs stable irrespective of any changes at its inputs. Adjacent pipeline stages are capable of storing distinct data items without spacers.

CROSS-REFERENCE TO RELATED APPLICATION

[0001] This application claims priority to U.S. Provisional PatentApplication entitled “Fine-Grain Pipelined Asynchronous Adders forHigh-Speed DSP Applications,” Serial No. 60/199,439, which was filed onApr. 25, 2000, and which is incorporated by reference in its entiretyherein.

BACKGROUND OF THE INVENTION

[0002] 1. Field of the Invention

[0003] This invention relates circuits and methods for asynchronouspipeline processing, and more particularly to pipelines providing highbuffering and high throughput.

[0004] 2. Background of the Related Art

[0005] There has been increasing demand for pipeline designs capable ofmulti-GigaHertz throughputs. Several novel synchronous pipelines havebeen developed for these high-speed applications. For example, in wavepipelining, multiple waves of data are propagated between two latches.However, this approach requires significant design effort, from thearchitectural level down to the layout level, for accurate balancing ofpath delays (including data-dependent delays), yet such systems remainhighly vulnerable to process, temperature and voltage variations. Otheraggressive synchronous approaches include clock-delayed domino,skew-tolerant domino, and self-resetting circuits. These approachesrequire complex timing constraints and lack elasticity. Moreover,high-speed global clock distribution for these circuits remains a majorchallenge.

[0006] Asynchronous design, which replaces global clocking with localhandshaking, has the potential to make high speed design more feasible.Asynchronous pipelines avoid the issues related to the distribution of ahigh-speed clock, e.g., wasteful clock power and management of clockskew. Moreover, the absence of a global clock imparts a naturalelasticity to the pipeline since the number of data items in thepipeline is allowed to vary over time. Finally, the inherent flexibilityof asynchronous components allows the pipeline to interface with variedenvironments operating at different rates; thus, asynchronous pipelinestyles are useful for the design of system-on-a-chip.

[0007] One prior art pipeline is Williams' PS0 dual-rail asynchronouspipeline (T. Williams, Self-Timed Rings and Their Application toDivision, Ph.D. Thesis, Stanford University, June 1991; T. Williams etal., “A Zero-Overhead Self Timed 160 ns 54 b CMOS Divider, IEEE JSSC,26(11):1651-1661, November 1991). FIG. 1 illustrates Williams' PS0pipeline 10. Each pipeline stage 12 a, 12 b, 12 c is composed of adual-rail function block 14 a, 14 b, 14 c and a completion detector 16a, 16 b, 16 c. The completion detectors indicate validity or absence ofdata at the outputs of the associated function block.

[0008] Each function block 14 a, 14 b, 14 c is implemented using dynamiclogic. A precharge/evaluate control input, PC, of each stage is tied tothe output of the next stage's completion detector. For example, theprecharge/evaluate control input, PC, of stage 12 a is tied to thecompletion detector 16 b of stage 12 b and is passed to function block14 a on line 18 a. Since a precharge logic block can hold its dataoutputs even when its inputs are reset, it also provides thefunctionality of an implicit latch. Therefore, a PS0 stage has noexplicit latch. FIG. 2(a) illustrates how a dual-rail AND gate, forexample, would be implemented in dynamic logic; the dual-rail pair, f₁and f₀, implements the AND of the dual-rail inputs a₁a₀ and b₁b₀.

[0009] The completion detector 16 a, 16 b, 16 c at each stage 12 a, 12b, 12 c, respectively, signals the completion of every computation andprecharge. Validity, or non-validity, of data outputs is checked byOR'ing the two rails for each individual bit, and then using a C-elementto combine all the results (See, FIG. 2(a)). A C-element is a basicasynchronous stateholding element. More particularly, the output of ann-input C-element is high when all inputs are high, is low when allinputs are low, and otherwise holds its previous value. It is typicallyimplemented by a CMOS gate with a series stack in both pull-up andpull-down, and an inverter on the output (with weak feedback inverterattached to maintain state).

[0010] The sequencing of pipeline control for the Williams' PS0dual-rail pipeline is as follows: Stage N is precharged when stage N+1finishes evaluation. Stage N evaluates when stage N+1 finishes reset.Actual evaluation will commence only after valid data inputs have alsobeen received from stage N−1. This protocol ensures that consecutivedata tokens are always separated by reset tokens or spacers.

[0011] The complete cycle of events for a pipeline stage is derived byobserving how a single data token flows through an initially emptypipeline. The sequence of events from one evaluation by stage 12 a, tothe next is: (i) Stage 12 a evaluates, then (ii) stage 12 b evaluates,then (iii) stage 12 b's completion detector 16 b detects completion ofevaluation, and then (iv) stage 12 a precharges. At the same time, aftercompleting step (ii), (iii)′ stage 12 c evaluates, then (iv)′ stage 12c's completion detector 16 c detects completion of evaluation, andinitiates the precharge of stage 12 b, then (v) stage 12 b precharges,and finally, (vi) stage 12 b's completion detector 16 b detectscompletion of precharge, thereby releasing the precharge of stage 12 aand enabling stage 12 a to evaluate once again. Thus, there are sixevents in the complete cycle for a stage, from one evaluation to thenext.

[0012] The complete cycle for a pipeline stage, traced above, consistsof 3 evaluations, 2 completion detections and 1 precharge. Theanalytical pipeline cycle time, T_(PS0), therefore is:

T _(PS0)=3·t _(Eval)+2·t _(CD) +t _(Prech)  (1)

[0013] where, t_(Eval) and t_(Prech), are the evaluation and prechargetimes for each stage, and t_(CD) is the delay through each completiondetector.

[0014] The per-stage forward latency, L, is defined as the time it takesthe first data token, in an initially empty pipeline, to travel from theoutput of one stage to the output of the next stage. For PS0, theforward latency is simply the evaluation delay of a stage:

L_(PS) ₀=t_(Eva)  (2)

[0015] A disadvantage of this type of latch-free asynchronous dynamicpipelines (e.g., PS0), is that alternating stages usually must contain“spacers,” or “reset tokens,” limiting the pipeline capacity to 50%.Another disadvantage of the Williams pipeline is that it requires anumber of synchronization points between stages. Moreover, William'smaintains data integrity by constraining the interaction of pipelinestages, i.e., the precharge and evaluation of a stage are synchronizedwith specific events in neighboring stages.

[0016] Three recent, competitive asynchronous pipelines provide improvedperformance but suffer from numerous disadvantages which have beenremoved by the digital signal processing pipeline apparatus inaccordance with the invention.

[0017] A design by Renaudin provides high storage capacity (M. Renaudinet al. “New Asynchronous Pipeline Scheme: Application to the Design of aSelf-Timed Ring Divider, IEEE JSSC, 31(7): 1001-1013, July 1996).Renaudin's pipelines achieve 100% capacity without extra latches or“identity stages.” Their approach locally manipulates the internalstructure of the dynamic gate in order to provide increased capacity.

[0018] However, there are two significant disadvantages of Renaudin'spipelines. First, in Renaudin's pipelines, extra latching is achieved bymodifying the output inverter of each dynamic gate into a gatedinverter, through the use of additional transistors. A seconddisadvantage of Renaudin's pipelines is a relatively low throughput. Inparticular, Renaudin's pipelines are based on a much more conservativeform of PS0 pipelines, called PC0. Consequently, their throughput, whilean improvement over PC0, is worse than even that of PS0.

[0019] The two FIFO designs by Molnar et al.—the asp* FIFO and themicropipelined FIFO—are among the most competitive pipelines presentedin literature, with reported throughputs of 1.1 Giga and 1.7 Gigaitems/second in 0.6 μm CMOS (C. Molnar et al., “Two FIFO RingPerformance Experiments,” Proceedings of the IEEE, 87(2):297-307,February 1999).

[0020] Molnar's first FIFO, asp*, has significant drawbacks. Whenprocessing logic is added to the pipeline stages, the throughput of theasp* FIFO is expected to significantly degrade relative to the pipelinedesigns described herein. This performance loss occurs because the asp*FIFO requires explicit latches to separate logic blocks. The latches areessential to the design; they ensure that the protocol will not resultin data overruns. As a result, in asp*, with combinational logicdistinct from latches, the penalty of logic processing can besignificant. In addition, the asp* FIFO has complex timing assumptionswhich have not been explicitly formalized; in fact, an early version wasunstable due to timing issues.

[0021] Molnar's second design, the micropipelined FIFO, also has severalshortcomings. First, the micropipeline is actually composed of twoparallel “half-rate” FIFO's, each providing only half of the totalthroughput (0.85 Giga items/second); thus, the net throughput of 1.7Giga items/second is achieved only at a significant cost in area.Second, the micropipelined uses very expensive transition latches.Another limitation of the micropipelined FIFO is that it cannot performlogic processing at all; i.e., it can only be used as a FIFO. The reasonfor this restriction is that it uses a complex latch structure in whichparts of each latch are shared between adjacent stages. As a result,insertion of logic blocks between latches is not possible.

[0022] Among the fastest designs reported in literature are the IPCMOSpipelines, with throughputs of 3.3-4.5 GHz in a 0.18 μm CMOS process (S.Shuster et al., “Asynchronous Interlocked Pipelined CMOS CircuitsOperating at 3.3-4.5 GHz, Proceedings ISSCC, February 2000). IPCMOS hasdisadvantages at the circuit as well as at the protocol levels. First,IPCMOS uses large and complex control circuits which have significantdelays. Second, IPCMOS makes use of extremely aggressive circuittechniques, which require a significant effort of design andverification. For example, one of the gates in their “strobe” circuitpotentially may have a short circuit through its pull-up and pull-downstacks, depending on the relative arrival times of inputs to the twostacks from multiple data streams. Their approach relies on a ratioingof the stacks to ensure correct output. Third, in IPCMOS, pipelinestages are enabled for evaluation only after the arrival of valid datainputs. Hence, the forward latency of a stage is poor, because of thedelay to precharge-release the stage.

[0023] It is an object of the invention to provide high throughput andhigh storage capacity through decoupling the controls of precharge andevaluation. It is another object to reduce the need for a “reset” spacerbetween adjacent data tokens to increase storage capacity

[0024] It is an object of the invention to provide an asynchronouspipeline having protocols wherein no explicit latches are required.

[0025] It is an object of the invention to provide an asynchronouspipeline having simple one-sided timing constraints, which may be easilysatisfied.

[0026] It is an object of the invention to provide an asynchronouspipeline having function blocks that may be enabled for evaluationbefore the arrival of data Thus, data insertion in an empty pipeline canripple through each stage in succession.

[0027] It is a further object to provide an asynchronous pipeline havinghigh data integrity, wherein a stage may hold its outputs stableirrespective of any changes in its inputs.

[0028] It is yet another object of the invention to provide anasynchronous pipeline having reduced critical delays, smaller chip area,lower power consumption, and simple, small and fast control circuits toreduce overhead.

[0029] It is another object of the invention to provide an asynchronouspipeline capable of merging multiple input data streams.

SUMMARY OF THE INVENTION

[0030] These and other objects of the invention are accomplished inaccordance with the principles of the invention through an asynchronousdigital pipeline circuit which allows a much denser packing of datatokens in the pipeline, thus providing higher storage, or buffering,capacity. Other beneficial features include low forward latency andeasily-satisfiable one-sided timing constraints.

[0031] An asynchronous digital pipeline circuit, having latchlessdynamic logic has a first processing stage configured to be driventhrough a cycle of phases consisting of a first precharge phase,followed by an first evaluate phase, followed by a first isolate phase.In the first isolate phase, the output of the first processing stage isisolated from changes in the input thereof, but maintains the value ofstored data at its outputs. The first processing stage is responsive toa first precharge control signal and a first evaluate control signal inorder to pass through the three cycles of operation. A first stagecontroller is responsive to a transition signal and provides the firstand second decoupled control signals to the first processing stage.

[0032] A second processing stage is configured to be driven through acycle of phases consisting of a second precharge phase, followed by asecond evaluate phase, followed by a second isolate phase, solely inresponse to a second precharge control signal and a second evaluatecontrol signal. The second processing stage provides a transition signalindicative of the phase thereof. An interconnection is provided betweenthe first processing stage and the second processing stage such thatreception of the transition signal by the first stage controller enablesthe first processing stage to cycle through the precharge phase, theevaluate phase, and the isolate phase while the second processing stageremains in one of the evaluate phase and the isolate phase. Under thesecircumstances, the first processing stage and the second processingstage are able to store different data tokens without separation by aspacer.

[0033] A single explicit synchronization point is provided between thefirst processing stage and the second processing stage. When thetransition signal indicative of the phase of the second processing stageis asserted, the first processing stage is enabled to begin the cycle ofprecharge, evaluate, and isolate. This single explicit synchronizationpoint increases the concurrency of operation. When the transition signalindicative of the phase of the second processing stage is de-asserted,however, there is no command to change the phase of the first processingstage.

[0034] Further features of the invention, its nature and variousadvantages will be more apparent from the accompanying drawings and thefollowing detailed description of the preferred embodiments.

BRIEF DESCRIPTION OF THE INVENTION

[0035]FIG. 1 is an illustration of a prior art pipeline.

[0036]FIG. 2(a) is an illustration of the circuit of a function block ofthe prior art pipeline of FIG. 1.

[0037]FIG. 2(b) is an illustration of a completion detector of the priorart pipeline of FIG. 1.

[0038]FIG. 3 is a block diagram of an asynchronous digital pipelinecircuit in accordance with the invention.

[0039]FIG. 4 is schematic diagram of a gate of the block diagram of FIG.3 in accordance with the invention.

[0040]FIG. 5 illustrates a sequence of phases of and interaction of thestages of the asynchronous digital pipeline circuit in accordance withthe invention.

[0041]FIG. 6(a) illustrates an exemplary Petri-net specification of anexemplary pipeline stage controller.

[0042]FIG. 6(b) illustrates a Petri-net specification of the pipelinestage controller in accordance with the invention.

[0043]FIG. 7(a) is a logic diagram of a stage controller of theasynchronous digital pipeline circuit illustrated in FIG. 3 inaccordance with the invention.

[0044]FIG. 7(b) is a circuit diagram of a portion of the stagecontroller illustrated in FIG. 7(a) in accordance with the invention.

[0045]FIG. 8 is a block diagram of an alternative embodiment of apipeline stage of an asynchronous digital pipeline circuit in accordancewith the invention.

[0046]FIG. 9 is a simplified block diagram of an exemplary embodiment inaccordance with the invention.

[0047]FIG. 10 is a block diagram of a portion of the embodiment of FIG.9 in accordance with the invention.

[0048]FIG. 11 is a simplified block diagram of another embodiment inaccordance with the invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0049] The asynchronous digital pipeline circuit in accordance with theinvention decouples the control of pull-up and pull-down in eachprocessing stage. A dynamic gate is controlled by two separate inputs,pc and eval. Using these inputs, a stage is driven through threedistinct phases in sequence: evaluate, isolate and precharge, as will bedescribed in greater detail herein. In the isolate phase, a stage holdsits output stable irrespective of any changes at its inputs. As aresult, adjacent pipeline stages are capable of storing distinct dataitems, thus obtaining 100% storage capacity.

[0050]FIG. 3 illustrates a simplified block diagram of a pipeline 100 inaccordance with the invention. Three exemplary stages 102 a/102 b/102 care depicted, although it is contemplated that there may be a greater orfewer number of stages. Each stage 102 a/102 b/102 c may comprise threecomponents: a function block 104 a/104 b/104 c, a completion generator106 a/106 b/106 c and a stage controller 108 a/108 b/108 c,respectively. Each function block 104 a/104 b/104 c alternately producesdata tokens and reset spacers for the next stage, and its completiongenerator 106 a/106 b/106 c indicates completion of the stage'sevaluation or precharge. The third component, the stage controller 108a/108 b/108 c, generates the pc and eval signals which control therespective function block 104 a/104 b/104 c and the completion generator106 a/106 b/106 c. These components are discussed in greater detailbelow.

[0051] A commonly-used asynchronous scheme, called bundled data, is usedto implement the single-rail asynchronous datapath. More particularly, acontrol signal, Req, on line 110 a/110 b/100 c indicates arrival of newinputs to a respective stage 102 a/102 b/102 c. For example, the signalReq on line 110 b is an input to the completion generator 106 b, and anoutput from completion generator 106 a. A high value of Req indicatesthe arrival of new data, i.e., the previous stage has completedevaluation. On the other hand, a low value of Req indicates the arrivalof a spacer, i.e., the previous stage has completed precharge. Forcorrect operation, a simple timing constraint must be satisfied: Reqmust arrive after the data inputs to the stage are stable and valid.This requirement is met by inserting a “matched delay” element 112 a/112b/112 c that provides a delay which is greater than or equal to theworst-case delay through the function block 104 a/104 b/104 c. Anadvantage of this approach is that the datapath itself can be builtusing standard single-rail (synchronous style) function blocks.

[0052] There are several common ways to implement a matched delay, suchas matched delay element 112 a/112 b/112 c. One preferred technique isto use an inverter chain, as is known in the art. Alternatively, a chainof transmission gates may be used, wherein the number of gates and theirtransistor sizing determines the total delay. An alternative techniqueduplicates the worst-case critical path of the logic block, and usesthat as a delay line. Bundled data has been widely used, including incommercial asynchronous Philips 80C51 microcontroller chips.

[0053] A portion of a function block 104 a/104 b/104 c is illustrated ingreater detail in FIG. 4. More particularly, FIG. 4 shows one gate of afunction block 104 in a pipeline stage. (When the suffix a, b, or c hasbeen omitted, the features described are common to all components havingthe same reference number.) The pc input on line 114 controls thepull-up stack 115 (i.e., the precharge) and the eval input on line 116controls the “foot” of the pull-down stack 117. Precharge occurs when pcis asserted low and eval is de-asserted low. Evaluation occurs when evalis asserted high and pc is de-asserted high. When both signals arede-asserted, the gate output is effectively isolated from the gateinputs; thus, it enters the “isolate phase.” To avoid a short circuit,pc and eval are never simultaneously asserted.

[0054] With continued reference to FIG. 3, the completion generator 106,is an asymmetric C-element, aC. An asymmetric C-element typically hasthree types of inputs: those that are marked “+”, those marked “−”, anda third type that is unmarked. The output of the aC is set high when allthe unmarked inputs and all the “+” inputs go high. Conversely, the aCoutput is reset low when all the unmarked inputs and all the “−” inputsgo low. For all other combinations, the aC holds its output value.Completion generator 106 has a positive input eval 116 and a negativeinput pc 114 from the stage controller 108, and a positive input Reqfrom the output of the previous stage.

[0055] In the pipeline 100 in accordance with the invention, the outputof the completion generator 106, Done is placed on line 120. The outputDone set high when the stage 102 has begun to evaluate, i.e., when twoconditions occur: (1) the stage 102 has entered its evaluate phase,i.e., eval is high, and (2) and the previous stage has supplied validdata input, i.e., completion signal Req of previous stage is high. Doneis reset simply when the stage is enabled to precharge, i.e., pcasserted low. Thus, a stage's precharge will immediately reset Done,while evaluate will only set Done if the stage is in evaluation andvalid data inputs have also arrived.

[0056] The output of the completion generator 106 on line 120 is fedthrough the matched delay element 112, which (when combined with thecompletion generator) matches the worst-case path through the functionblock 104. Typically for extremely fine-grain or “gate-level” pipelines,the matched delay may be unnecessary, because the aC delay itselfproduced by the completion generator 106 often already matches the delayof the function block 104, so no additional matched delay is required.

[0057] Finally, the completion signal Done on line 120 is divided threeways and fed to three components: (i) the previous stage's controller108 on line 122, indicating the current stage's state, e.g., on line 122b to stage controller 108 a; (ii) the current stage's stage controller108, e.g., on line 124 b to stage controller 108 b (through the matcheddelay element 112 b); and (iii) the next stage's completion generator106, e.g., on line 110 c to completion generator 106 c (through thematched delay element 112 b).

[0058] With continued reference to FIG. 3, the stage controller 108produces control signals pc and eval for the function block 104 and thecompletion generator 106. The stage controller 108 itself receives twoinputs: (1) the delayed Done signal of the current stage on line 124(i.e. Req), henceforth referred to as signal S, e.g., signal S mayarrive at stage controller 108 b on line 124 b, and (2) the Done signalof the next stage, henceforth referred to as signal T, e.g., signal Tarrives at stage controller 108 b on line 122 c. The stage controller108 produces the two decoupled control signals, pc and eval. Details ofthe stage controller's protocol and implementation will be described ingreater detail herein.

[0059] Each stage 102 in pipeline 100 cycles through three phases, asillustrated in FIG. 5. Cycle 200 a is illustrated for a first Stage N,and cycle 200 b is illustrated for adjacent stage N+1. After Stage Ncompletes its evaluate phase 202 a, it then enters its isolate phase 204a and typically does not proceed to the precharge phase 206 a until itreceives a signal from stage N+1, as will be described below. As soon asthe precharge phase 206 a is complete, it re-enters the evaluate phase202 a again, completing the cycle. (Stage N+1 likewise passes throughevaluate 202 b, isolate 204 b, and precharge 206 b phases as indicatedin dotted line.)

[0060] There is one explicit synchronization point, or interconnection,between stages N and N+1. As illustrated by dotted line 210, once astage N+1 has completed its evaluate phase 202 b, it enables theprevious stage N to perform its entire next cycle: i.e., precharge phase206 a, evaluation phase 202 a, and isolate phase 204 a for the new dataitem. There is also one implicit synchronization point as illustrated bydotted line 211: the dependence of stage's N+1's evaluation phase 202 bon its predecessor stage N's evaluation phase 202 a. A stage cannotproduce new data until it has received valid inputs from itspredecessor. Both of the synchronization points are shown by thecausality arcs in FIG. 5.

[0061] Once a stage finishes evaluation, it isolates itself from itsinputs by a self-resetting operation. The stage enters the isolate phaseregardless of whether this stage is allowed to enter its prechargephase. As a result, the previous stage can not only precharge, but evensafely evaluate the next data token, since the current stage will remainisolated. For example, when stage N+1 completes it evaluate phase 202 a,it enters the isolate phase 204 a while stage N may precharge 206 a andevaluate 202 a without affecting the output of stage N+1.

[0062] There are two benefits of this protocol: (a) higher throughput,since a stage N can evaluate the next data item even before stage N+1has begun to precharge; and (b) higher capacity for the same reason,since adjacent pipeline stages are now capable of simultaneously holdingdistinct data tokens, without requiring separation by spacers.

[0063] A formal specification of the stage controller is given in FIG.6(a) in the form of a Petri-net (a well-known graphical representationcommonly used to describe concurrent behaviors). It consists oftransitions, indicated by labeled events, and places, which store tokenswhich are indicated by black dots. A transition fires when all of itsincoming arcs have tokens, which are then deposited on all of itsoutgoing arcs. (Further details concerning Petri-nets are discussed inTadao Murata, “Petri Nets: Properties, Analysis and Applications,”Proceedings of the IEEE, 77(4), April 1989; L. Y. Rosenblum and A. V.Yakolev, “Signal Graphs: From Self-Timed to Timed Ones,” Proceedings ofInternational Workshop on Timed Petri Nets, Torino, Italy, pp. 199-207,July 1985; and Tam-Anh Chu, “On the Models for Designing VLSIAsynchronous Digital Circuits,” Integration, the YLSI Journal,4(2):99-113, June 1986, which are incorporated by reference in theirentirety herein.)

[0064] A Petri-net specification for the stage controller 108 can bededuced from the sequence of phases in a stage cycle, as illustratedwith respect to FIG. 5, above. The controller of stage N has two inputs,S and T. which are the Done outputs of stage N and stage N+1respectively (see FIG. 3), and it has two outputs, pc and eval, whichdrive stage N.

[0065]FIG. 6(a) illustrates a Petri-net for a preliminary design of astage controller. The specification shown in FIG. 6(a) presents severalshortcomings, as detailed herein. The enabling condition for theprecharge of stage N at 302 is ambiguous: stage N has completedevaluation of a data item and is entering the isolate phase at 304(signal S 124 is high), and stage N+1 has evaluated the same data itemat 306 (signal T 122 is high). A problem arises if stage N+1 is blockedor slow, it may continue to maintain its high T 122 output, while stageN processes an entire new data input (precharge then evaluate). In thiscase, the signals S 124 and T 122 again are both high, but now stage Nand stage N+1 have distinct tokens. In this case, since stage N+1 hasnot absorbed the new data, stage N must not be precharged.

[0066] A solution to this problem is obtained by adding a statevariable, ok2pc 117, implemented by an asymmetric-C element in the stagecontroller (see FIG. 6(b)). The specification in FIG. 6(b) issubstantially identical to the specification in FIG. 6(b), with thedifferences noted herein. The variable effectively ok2pc 153 recordswhether stage N+1 has absorbed a data item. As illustrated by Petri-net350, okc2pc 153 is reset immediately after stage N precharges at 352(signal S 124 is low), and is only set again once N+1 has undergone asubsequent precharge at 354 (signal T 122 is low).

[0067] FIGS. 7(a) and 7(b) show an implementation of the controller ofFIG. 6(b). The implementation incorporates two input signals, T 122 andS 124, and produces three output signals-pc 114, eval 116, and ok2pc 153each of which is implemented using a single gate. The controllersdirectly implement the conditions described above and in the previoussubsection.

[0068] More particularly, signal eval 116 is the output of an inverter150 on the S signal 124. With reference to FIG. 7(a) in connection withFIGS. 3 and 5, a stage 104 cycles from the evaluate phase 202 to theisolate phase 204 when eval 116 is de-asserted low without any furtherinputs. In the embodiment, after stage 104 evaluates, the signal eval116 is passed through the completion detector on line 120 a, and throughthe matched delay element 112. The output of match delay element 112 isS signal on line 124. After passing through the inverter 150 of stagecontroller 108, eval signal 116 is de-asserted low, which allows thestage to enter the isolate phase 204.

[0069] The generation of the ok2pc signal 153 is performed by theasymmetric C element 152, illustrated in greater detail in FIG. 7(b). Aninverter 156 is added to receive the T signal 122, since ok2pc 153 isset high after stage N+1 has completed evaluation and T 122 is low.Although the generation of ok2pc appears to add an extra gate delay tothe control path to pc 114, the protocol of FIG. 6 performs thiscalculation off of the critical path, i.e., ok2pc is set in “backgroundmode,” so that ok2pc is typically set before T 122 gets asserted. As aresult, the critical path to pc 114 is only one gate delay: from input T122 through the 3-input NAND3 gate 154, to the output pc 114.

[0070]FIG. 8 illustrates one complete pipeline stage 401, according toanother embodiment in accordance with the invention. More particularly,completion generator 106 and function block 104 of the pipeline stage102 illustrated in FIG. 3 have been incorporated into a combinedfunction block 405 The stage 401 also includes matched delay element 412and stage controller 408 (indicated by the dashed line). The output 419of function block 405 is passed to the next stage (not shown in FIG. 8).The output 420 of the completion generator (not shown) is divided intotwo signals, one part is passed to matched delay element 412 and theother part becomes signal 422 a, which is passed to a previous stage orto the environment. The output of the matched delay element 412 is Ssignal 424. The inputs to the stage controller 408, as described abovewith respect to stage controller 108, are the S signal 424 and the Tsignal 422 b from the next stage (not shown in FIG. 8).

[0071] A complete cycle of events for stage N can be traced withreference to FIGS. 3 and 5. From one evaluation by stage N to the nextevaluation, the cycle consists of three operations: step (i) stage Nevaluates, step (ii) stage N+1 evaluates, which in turn enables stageN's controller to assert the precharge input (pc=low) of N. step (iii)stage N precharges, the completion of which, passing through stage N'scontroller, enables N to evaluate once again (eval asserted high). Withreference to the reference numbers in FIGS. 3 and 5, the processproceeds as follows: step (i) stage 102 a evaluates 202 a and advancesto the isolate stage 204 a and waits. Subsequently, at step (ii), stage102 b evaluates 202 b, and upon completion of stage 102 b's evaluation,stage controller 108 a receives T signal 112 b and asserts pc 114 a. Atstep (iii) Stage 102 a precharges 206 a, and is enabled to evaluateagain.

[0072] As described above, no extra matched delays may be required forthe gate-level pipeline, because the completion detector and otherdelays already match the gate's evaluate and precharge. Then, in thenotation introduced earlier, the delay of step (i) is t_(Eval), thedelay of step (ii) is t_(aC)+t_(NAND3), and the delay of step (iii) ist_(Prech)+t_(INV). Here, t_(NAND3) and t_(INV) are the delays throughthe NAND3 154 and the inverter 150, respectively, of FIG. 7(a). Thus,the analytical pipeline cycle time is:

T HC =t _(Eval) +t _(Prech) +t _(aC) +t _(NAND3) +t _(INV)  (3)

[0073] A stage's latency is simply the evaluation delay of the stage:

L_(HC)=t_(Eval)  (4)

[0074] The pipeline 100 according to the invention requires a one-sidedtiming constraint for correct operation. The ok2pc signal 153 goes highonce the current stage has evaluated, and the next stage has precharged(S=1, T=0). Subsequently, signal T goes high as a result of evaluationby the next stage. For correct operation, okc2pc signal must completeits rising transition before T signal goes high:

t _(ok2pc↑) <t _(Eval) +t _(INV)  (5)

[0075] In practice, this constraint was very easily satisfied.

[0076] An adequate precharge width must be enforced. In this design, theconstraint is partly enforced by the bundling constraint: the aC elementand the (optional) matched delay, together, must have greater delay thanthe worst-case precharge time of the function block. Hence, the S signalto the NAND3 154 in FIG. 7(a) will be maintained appropriately.

[0077] There is an additional constraint on precharge width: the Tsignal to the NAND3 154 must not be de-asserted. For example, in thecase where the T signal were asserted high, stage 102 a's NAND3 154 astarts the precharge of 102 a (in FIG. 3). Concurrently, T signal 122 bwill only be reset after a path through the asymmetric C-element (aC) ofstage 102 c's completion generator 106 c, through the NAND3 154 b ofstage controller 108 b and through the asymmetric C-element (aC) ofcompletion generator 106 b of stage 102 b, and finally through the NAND3154 a of stage controller 108 a:

t _(NAND3) +t _(Prech) _(N) ≦t _(aC) +t _(NAND3) +t _(aC) +t_(NAND3)  (6)

[0078] Assuming all stages are similar, this constraint becomes:

t _(Prech) _(N) ≦t _(aC) +t _(aC) +t _(NAND3)  (7)

[0079] This timing constraint is also easily satisfied.

[0080] The inverter 150 in FIG. 7(a) is used to enable the isolation ofa stage after it evaluates. The bundling constraint already ensures thatthe isolate phase does not start too early.

EXAMPLE

[0081] As a case study, a gate-level pipelined adder was simulated usingpipeline described herein. The example shows how multiple input streamsfor a pipeline stage can be merged together into a single output stream.

[0082] A 32-bit ripple-carry adder was selected, since its design issimple and amenable to very fine-grain pipelining. The adderconfiguration is suitable for high-throughput applications such as DSP'sfor multimedia processing.

[0083]FIG. 9 illustrate an exemplary stage 500 of a ripple-carry adder.Each stage of the adder is a full-adder, which has three data inputs—A502, B 504, carry-in C_(in) 506—and two outputs—carry-out C_(out) 508and Sum 510). The logic equations are:

Sum=A⊕B⊕C _(in), and  (8)

C _(out) =AB+AC _(in) +BC _(in).  (9)

[0084] A mixture of dual-rail and single-rail encodings are used torepresent the adder datapath. Since the exclusive-or operation needsboth true and complemented values of its operands, two rails are used torepresent each of the data inputs, A, B and C_(in) as required fordynamic logic implementation. Further, since C_(out) of a stage is theC_(in) of the next stage, it is also represented using two rails. Sum,on the other hand, is represented using only a single rail, since itscomplemented value is not needed. The entire datapath is a bundleddatapath, and therefore, may be regarded as single-rail, even thoughsome of the signals are represented using two rails.

[0085] Denoting A 502, B 504, C_(in) 506 and C_(out) 508 by a₁a₀, b₁b₀,c_(in) ₁ c_(in) ₀ and c_(out) ₁ c_(out) ₀ respectively, the adderequations are written as:

Sum(a ₁ b ₀ +a ₀ b ₁)c _(in) ₀ +(a ₁ b ₁ +a ₀ b ₀)c _(in) ₁ ,  (10)

c _(out) ₁ =a ₁ b ₁+(a ₁ +b ₁)c _(in) ₁ , and  (11)

C_(out) ₀₀ =a ₀ b ₀+(a ₀ +b ₀)c _(in) ₀ .  (12)

[0086] In the embodiment, each of the three outputs, Sum, c_(out) ₁ andc_(out) ₀ , was implemented using a single dynamic gate. Thus, eachstage has only one level of logic.

[0087] Unlike the pipeline structures described herein, the pipelinedadder is a non-linear structure. A stage 500 may merge three distinctinput streams, i.e., the two data operands and the carry-in. Therefore,alternative embodiments of the pipeline structures are described hereinto handle multiple sources. In particular, since each full-adder stagerepresents a synchronization point of multiple input streams, it musthave the capability to handle multiple bundled inputs (i.e., “request”signals).

[0088] The inputs A 502 and B 504 may be taken as belonging to oneshared data stream with a common bundling signal req_(ab) 523. TheC_(in) input along with carry-in req_(c) 525 forms the other stream.Thus, only two input streams are assumed: data operands and carry-in. Inpractice, this is a reasonable assumption in many applications whereoperands come from the same source. If this assumption does not hold,our approach can be extended to handle three independent threads.

[0089]FIG. 10 illustrates another embodiment of the completion generator506. The completion generator 506 synchronize on both the data inputsreq_(ab) 523 and the carry-in input req_(c) 525. Each additional requestsignal is accommodated by adding one transistor to the pull-down stackof the asymmetric C-element 552 of the completion generator. Theresulting Done output signal 520 is forked to three destinations, i.e.,as “acknowledgements” to the stage that sent the carry-in, to the stagethat sent the operands, and also as a “request” to the next stage.

[0090] Finally, the entire adder architecture is shown in FIG. 11. Shiftregisters 568 a, 568 b, and 568 c provide operand bits to each of adderstages 500 a, 500 b, and 500 c respectively. A shift-register 570 a, 570b, 570 c is attached to each respective adder stage 500 a, 500 b, 500 cis to accumulate the stream of sum bits coming out of that stage. Onceall the sum bits for an addition operation are available, they can beread off in parallel, one bit from each shift-register. The shiftregisters can themselves be built as asynchronous pipelines according tothe embodiments described herein.

[0091] The 32-bit ripple carry adders were simulated in HSPICE using a0.6 μm HP CMOS process with operating conditions of 3.3V power supplyand 300° K. Special care was taken to optimize the transistor sizing forhigh-throughput. The precharge PMOS transistors in each dynamic gate hada W/L ration of 188λ/2λ. The NMOS transistors in the evaluation stackwere so sized that the effective width of the n-stack was ⅓ that of thep-stack. Furthermore, for each of the designs, it was ensured that thetiming constraints of were comfortably met. TABLE 1 t_(Eval) t_(Prech)t_(aC) t_(NAND3) t_(INV) Cycle Time Throughput (ns) (ns) (ns) (ns) (ns)Analytical Formula (ns) 10⁶ items/sec 0.26 0.23 0.26 0.12 0.11t_(Eval) + t_(Prech) + t_(aC) + t_(NAND3) + t_(INV) 0.98 1023

[0092] Table 1 lists the overall cycle time as well as its breakdowninto components: stage evaluation time (t_(Eval)), stage precharge time(t_(Prech)), the delay though the completion block (t_(aC)), as well asthe delays through the control gates (t_(NAND3) and t_(INV)). Finally,the table lists the throughput of each adder in million operations persecond. The throughputs of the adders was found to be 1023 millionoperations per second.

[0093] It will be understood that the foregoing is only illustrative ofthe principles of the invention, and that various modifications can bemade by those skilled in the art without departing from the scope andspirit of the invention.

What is claimed is:
 1. A latchless dynamic asynchronous digital pipelinecircuit comprising: a first processing stage configured to be driventhrough a cycle of phases consisting of a first precharge phase,followed by an first evaluate phase, and followed by a first isolatephase, solely in response to a first precharge control signal and afirst evaluate control signal, wherein the output of the firstprocessing stage is isolated from changes in the input thereof when inthe first isolate phase and wherein the first precharge control signalis decoupled from the first evaluate control signal; a first stagecontroller responsive to a transition signal indicative of a phase of asecond processing stage and configured to provide the first prechargecontrol signal and first evaluate control signal to the first processingstage; a second processing stage configured to be driven through a cycleof phases consisting of a second precharge phase, followed by a secondevaluate phase, and followed by a second isolate phase, solely inresponse to a second precharge control signal and a second evaluatecontrol signal; a completion generator which is configured to providethe transition signal indicative of the phase of the second processingstage which is asserted upon completion of the second evaluate stage,wherein one interconnection is provided between the first processingstage and the second processing stage such that reception by the firststage controller of the transition signal indicative of the stage of thesecond processing stage enables the first processing stage to cyclethrough the precharge phase, the evaluate phase, and the isolate phasewhen the transition signal indicative of the phase of the secondprocessing stage is asserted.
 2. The latchless dynamic asynchronousdigital pipeline circuit as recited in claim 1, further comprising: afirst completion generator configured to provide a first transitionsignal indicative of the phase of the first processing stage to thefirst stage controller.
 3. The latchless dynamic asynchronous digitalpipeline circuit as recited in claim 2, wherein the first processingstage enters the first evaluate phase when the first stage controllerasserts the first evaluate control signal and de-asserts the firstprecharge control signal.
 4. The latchless dynamic asynchronous digitalpipeline circuit as recited in claim 2, wherein the first processingstage enters the first isolate phase when the first stage controllerde-asserts the first evaluate control signal and de-asserts the firstprecharge control signal.
 5. The latchless dynamic asynchronous digitalpipeline circuit as recited in claim 2, wherein the first processingstage enters the first precharge phase when the first stage controllerde-asserts the first evaluate control signal and the first prechargecontrol signal is asserted.
 6. The latchless dynamic asynchronousdigital pipeline circuit as recited in claim 2, wherein the firsttransition signal is asserted when the first processing stage hasentered the evaluate phase and a previous stage has provided validinputs to the first processing stage.
 7. The latchless dynamicasynchronous digital pipeline circuit as recited in claim 2, wherein thefirst transition signal is de-asserted when the first processing stagehas entered the precharge stage.
 8. A latchless dynamic asynchronousdigital pipeline circuit comprising: a first processing stage configuredto be driven through a cycle of phases consisting of a first prechargephase, followed by an first evaluate phase, and followed by a firstisolate phase, solely in response to a first precharge control signaland a first evaluate control signal; a first stage controller responsiveto a transition signal and providing the first precharge control signaland the first evaluate control signal to the first processing stage,wherein the first precharge control signal is decoupled from the firstevaluate control signal; and a second processing stage configured to bedriven through a cycle of phases consisting of a second precharge phase,followed by a second evaluate phase, and followed by a second isolatephase, solely in response to a second precharge control signal and asecond evaluate control signal, wherein the output of the secondprocessing stage is isolated from changes in the input thereof when inthe second isolate phase, and wherein the second processing stageprovides a transition signal indicative of the phase thereof to thefirst stage controller, wherein an interconnection is provided betweenthe first processing stage and the second processing stage such thatreception of the transition signal by the first stage controller enablesthe first processing stage to cycle through the first precharge phase,the first evaluate phase, and the first isolate phase while the secondprocessing stage remains in the second isolate phase, such that firstprocessing phase and the second processing phase may store two separatetokens without separation by a precharge phase.
 9. The latchless dynamicasynchronous digital pipeline circuit as recited in claim 8, furthercomprising: a first completion generator configured to provide a firsttransition signal indicative of the phase of the first processing stageto the first stage controller.
 10. The latchless dynamic asynchronousdigital pipeline circuit as recited in claim 9, wherein the firstprocessing stage enters the first evaluate phase when the first stagecontroller asserts the first evaluate control signal and de-asserts thefirst precharge control signal.
 11. The latchless dynamic asynchronousdigital pipeline circuit as recited in claim 9, wherein the firstprocessing stage enters the first isolate phase when the first stagecontroller de-asserts the first evaluate control signal and de-assertsthe first precharge control signal.
 12. The latchless dynamicasynchronous digital pipeline circuit as recited in claim 9, wherein thefirst processing stage enters the precharge phase when the first stagecontroller de-asserts the fist evaluate control signal and the firstprecharge control signal is asserted.
 13. The latchless dynamicasynchronous digital pipeline circuit as recited in claim 9, wherein thefirst transition signal is asserted when the first processing stage hasentered the evaluate phase and a previous stage has provided validinputs to the fits processing stage.
 14. The latchless dynamicasynchronous digital pipeline circuit as recited in claim 9, wherein thefirst transition signal is de-asserted when the first processing stagehas entered the precharge stage.
 15. A processing stage in an latchlessdynamic asynchronous digital circuit, having latchless dynamic logiccomprising: a stage controller configured to provide a precharge controlsignal and an evaluate control signal, wherein the precharge controlsignal is decoupled from the evaluate control signal; and a firstprocessing stage having an input and an output and configured to bedriven through a cycle of phases consisting of a precharge phase,followed by an evaluate phase, followed by an isolate phase, andreturning to the precharge phase solely in response to the prechargecontrol signal and the evaluate control signal from the stagecontroller, wherein the output is isolated from changes in the inputthereof when in the isolate phase.
 16. The processing stage in alatchless dynamic asynchronous digital pipeline circuit as recited inclaim 15, wherein the first processing stage enters the first evaluatephase when the first evaluate control signal is asserted and the firstprecharge control signal is de-asserted.
 17. The processing stage in alatchless dynamic asynchronous digital pipeline circuit as recited inclaim 15, wherein the first processing stage centers the first isolatephase when the first evaluate control signal is de-asserted and thefirst precharge control signal is de-asserted.
 18. The processing stagein a latchless dynamic asynchronous digital pipeline circuit as recitedin claim 15, wherein the first processing stage enters the firstprecharge phase when the first evaluate control signal is de-assertedand the first precharge control signal is asserted.
 19. A latchlessdynamic asynchronous digital pipeline circuit comprising: a firstprocessing stage comprising a first function block comprising a pull-upstack controlled by a first precharge control signal, a pull-down stackcontrolled by a first evaluate control signal, a first data input and afirst data output, configured to be driven through a cycle of phasesconsisting of a first precharge phase, followed by an first evaluatephase, and followed by a first isolate phase, solely in response to afirst precharge control signal and a first evaluate control signal,wherein the first data output is isolated from a change in the firstdata input when the first precharge control signal and the secondprecharge control signal are de-asserted in the first isolate phase; afirst completion generator comprising an asymmetric C-element configuredto receive the first precharge control signal, the first evaluatecontrol signal, and a valid data input signal as inputs, wherein theasymmetric C-element produces a first transition signal which isasserted when both the first evaluate control signal and the valid datainput are asserted, and which is de-asserted when the first prechargecontrol signal is asserted; a first matched delay element comprising aplurality of inverters configured to receive the first transition signalas an input and to produce a first delayed transition signal as anoutput; a first stage controller comprising an inverter configured toreceive the first delayed transition signal from the first matched delayelement as an input and to produce the first evaluate control signal asan output, an asymmetric C-element configured to receive the firstdelayed transition signal and a second transition signal from a secondprocessing stage as inputs and to produce an intermediate control signalwhich is de-asserted when the first delayed transition signal isde-asserted and which is asserted when the first transition signal isasserted and the second transition signal is asserted, and a NAND-gateconfigured to receive the first delayed transition signal, the secondtransition signal and the intermediate control signal as inputs, and toproduce the first precharge signal as an output; and a second processingstage comprising a second function block comprising a pull-up stackcontrolled by a second precharge control signal, a pull-down stackcontrolled by a second evaluate control signal, a second data input anda second data output, configured to be driven through a cycle of phasesconsisting of a second precharge phase, followed by an second evaluatephase, and followed by a second isolate phase, solely in response to asecond precharge control signal and a second evaluate control signal,wherein the second data output is isolated from a change in the seconddata input when the second precharge control signal and the secondprecharge control signal are de-asserted in the second isolate phase; asecond completion generator comprising an asymmetric C-elementconfigured to receive the second precharge control signal, the secondevaluate control signal, and the first delayed transition signal inputas inputs, wherein the asymmetric C-element produces a second transitionsignal which is asserted when both the second evaluate control signaland the valid data input are asserted, and which is de-asserted when thesecond precharge control signal is asserted; a second matched delayelement comprising a plurality of inverters configured to receive thesecond transition signal as an input and to produce a second delayedtransition signal as an output; a second stage controller comprising aninverter configured to receive the second delayed transition signal fromthe second matched delay element as an input and to produce the secondevaluate control signal as an output, an asymmetric C-element configuredto receive the second delayed transition signal and a third transitionsignal from the environment as inputs and to produce an intermediatecontrol signal which is de-asserted when the second delayed transitionsignal is de-asserted and which is asserted when the first delayedtransition signal is asserted and the second transition signal isasserted, and a NAND-gate configured to receive the second delayedtransition signal, the second transition signal and the intermediatecontrol signal as inputs, and to produce the second precharge signal asan output.
 20. A method for latchless dynamic asynchronous digitalpipeline processing with a latchless dynamic asynchronous pipeline, themethod comprising: providing a first processing stage configured to bedriven through a cycle of phases consisting of a first precharge phase,followed by an first evaluate phase, and followed by a first isolatephase, solely in response to a first precharge control signal and afirst evaluate control signal; providing a second processing stageconfigured to be driven through a cycle of phases consisting of a secondprecharge phase, followed by a second evaluate phase, followed by asecond isolate phase, solely in response to a second precharge controlsignal and a second evaluate control signal, wherein the output of thesecond processing stage is isolated from changes in the input thereofwhen in the second isolate phase; executing the first evaluate phase atthe first processing stage; executing the second evaluate phase at thesecond processing stage and providing a transition signal indicative ofthe phase of the second processing stage to the first stage controller;providing the first precharge control signal and the first evaluatecontrol signal to the first processing stage by the first stagecontroller in response to the transition signal; and executing theprecharge phase, the first evaluate phase, and the first isolate phaseof the first processing stage while the second processing stage remainsin one of the second evaluate phase and the second isolate phase, suchthat first processing phase and the second processing phase may storetwo separate tokens without separation by a precharge phase.